In this project, Red Wine Quality data is used to know the relationship between the wine features.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [13] "quality2"
## 'data.frame': 1599 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : Ord.factor w/ 6 levels "3"<"4"<"5"<"6"<..: 3 3 3 4 3 3 3 5 5 3 ...
## $ quality2 : num 5 5 5 6 5 5 5 7 7 5 ...
This dataset has 1599 observations and 13 varibales. quality variable has two columns, one is ordinal and other is numeric.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol quality
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40 3: 10
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 4: 53
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20 5:681
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42 6:638
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 7:199
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90 8: 18
## quality2
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Median quality is 5.68, while mean quality is almost 6.64.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a normal distributed data. The range of fixed acidity is between 4 and 16. The most fixed acidity is between 7 and 8.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a normal distributed data. The range of volatile acidity is between 0.1 and 1.6. The most volatile acidity is between 0.4 and 0.7.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a right skewed distribution. The range of the critic acid is between 0 and 1.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a right skewed distribution. The range of residual sugar is between 1 and 16. The most residual sugar is between 1.5 and 3.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a normal distribution. The range of chlorides is between 0.0 and 0.65. The most chlorides is between 0.05 and 0.1.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a right skewed distribution. The range of free sulfur dioxide is between 0 and 72. The most free sulfur dioxide is between 1 and 10.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a right skewed distribution. The range of the total sulfur dioxide is between 0 and 300. The most total sulfur dioxide is between 10 and 50.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a normal distribution. The range of density is between 0.99 and 1.005. The most density is between 0.995 and 1.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a normal distribution. The range of pH is between 2.75 and 4.25. The most pH is between 3.25 and 3.5.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a normal distribution. The range of sulphates 0.25 and 2. The most sulphate is between 0.5 and 0.75.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This is a right skewed distribution. The range of alcohol is between 8 and 15. The most alcohol is between 9 and 10.
Quality is categorical, the range is between 3 and 8, the highest value is at 5.
I will recategorize the quality as a rank. low = < 5 good = (from 5 to 6) very good = > 6
# Recategorize the quality as rank (low, good, very good)
wine_data$rank <- ifelse(wine_data$quality < 5, 'low', ifelse(
wine_data$quality < 7, 'good', 'very good'))
wine_data$rank <- ordered(wine_data$rank, levels = c('low', 'good', 'very good'))
summary(wine_data$rank)
## low good very good
## 63 1319 217
ggplot(data = wine_data, aes(x=rank, fill=rank)) +
geom_bar() + theme_minimal() +
scale_fill_brewer(type = 'seq', palette = 4)
This dataset has 1599 observations and 12 variables.
Quiality is the main feature interest.
I think Alcohol, pH, volatile acidity and total sulfur dioxide.
Yes, I created rank variable to recategorize the quality as (low, good, very good).
The data set is tidy and good and I did not do anything to it. I use it as it is.
## Warning in ggscatmat(wine_data, columns = 1:13): Factor variables are
## omitted in plot
The higher absolute value of correlation coeffeceint, the higher the relationship between the two factors. From the matrix, we can notice that the highest absolute value of correlation coeffeceint is between pH and fixed.acidity with a value of -0.68. The second highest relationship is between density and fixed.acidity with 0.67, and also between citric.acid and fixed.acidity with 0.67.
# Box plot
ggplot(aes(x = pH, y = fixed.acidity), data = wine_data) +
geom_point(position = position_jitter(h = 0), color="purple") +
stat_smooth(method = 'lm')+
labs(title="pH VS Fixed acidity",
x="pH", y ="Fixed acidity")
From the figure above, we can notice that there is a negative strong relationship between pH and fixed.acidity
# Box plot
ggplot(aes(x = density, y = fixed.acidity), data = wine_data) +
geom_point(position = position_jitter(h = 0), color="purple") +
stat_smooth(method = 'lm') +
labs(title="Density VS Fixed acidity",
x="Density", y ="Fixed acidity")
From the figure above, we can notice that there is a positive strong relationship between density and fixed.acidity
# Box plot
ggplot(aes(x = citric.acid, y = fixed.acidity), data = wine_data) +
geom_point(position = position_jitter(h = 0), color="purple") +
stat_smooth(method = 'lm') +
labs(title="Citric acid VS Fixed acidity",
x="Citric acid", y ="Fixed acidity")
From the figure above, we can notice that there is a positive strong relationship between citric.acid and fixed.acidity
After the analysis, the most two factors that have the stronest relationship is between pH and fixed.acidity.
From the matrix, I notices that most factor that affects the quality is “alcohol” with correlation coeffecenit equals 4.8.
The strongest relationship between all the factors is between fixed.acidity and citric.acid.
# Box plot
ggplot(aes(x = quality, y = alcohol), data = wine_data) +
geom_point(aes(color = rank, fill = rank), position = position_jitter(h = 0)) + geom_boxplot(alpha = 0.5) + scale_colour_brewer(palette=3)+
labs(title="Quality VS Alcohol",
x="Quality", y ="Alcohol")
We can notice from the figure above that the most distribution is in between 4.5 and 7.5. And there is a positive relationship between quality and alcohol.
# Box plot
ggplot(aes(x = quality, y = volatile.acidity), data = wine_data) +
geom_point(aes(color = rank, fill = rank), position = position_jitter(h = 0)) + geom_boxplot(alpha = 0.5) + scale_colour_brewer(palette=3)+
labs(title="Quality VS Volatile acidity",
x="Quality", y ="Volatile acidity")
We can notice from the figure above that the most distribution is in between 4.5 and 7.5. And there is a negative relationship between quality and volatile acidity.
# Box plot
ggplot(aes(x = pH, y = fixed.acidity), data = wine_data) +
geom_point(aes(color = rank, fill = rank), position = position_jitter(h = 0)) +
stat_smooth(method = 'lm')+
labs(title="pH VS Fixed acidity",
x="pH", y ="Fixed acidity")
From the figure above, we can notice that there is a negative strong relationship between pH and fixed acidity and the most rank is “good”.
# Box plot
ggplot(aes(x = density, y = fixed.acidity), data = wine_data) +
geom_point(aes(color = rank, fill = rank), position = position_jitter(h = 0)) +
stat_smooth(method = 'lm') +
labs(title="Density VS Fixed acidity",
x="Density", y ="Fixed acidity")
From the figure above, we can notice that there is a positive strong relationship between density and fixed acidity. Also, we can notice that the “good” rank has the strongest relationship.
# Box plot
ggplot(aes(x = citric.acid, y = fixed.acidity), data = wine_data) +
geom_point(aes(color = rank, fill = rank), position = position_jitter(h = 0)) +
stat_smooth(method = 'lm') +
labs(title="Citric acid VS Fixed acidity",
x="Citric acid", y ="Fixed acidity")
From the figure above, we can notice that there is a positive strong relationship between citric.acid and fixed acidity.
# Recategorize the quality as rank (low, good, very good)
wine_data$rank <- ifelse(wine_data$quality < 5, 'low', ifelse(
wine_data$quality < 7, 'good', 'very good'))
wine_data$rank <- ordered(wine_data$rank, levels = c('low', 'good', 'very good'))
summary(wine_data$rank)
## low good very good
## 63 1319 217
ggplot(data = wine_data, aes(x=rank, fill=rank)) +
geom_bar() + theme_minimal() +
scale_fill_brewer(type = 'seq', palette = 4)
Snice quality has 6 categoried which is as a number, I decide to recategorize them to be more clear, and this figure describe the categories. (low, good, very good)
# Box plot
ggplot(aes(x = pH, y = fixed.acidity), data = wine_data) +
geom_point(position = position_jitter(h = 0), color="purple") +
stat_smooth(method = 'lm')
The correlation coefficient between pH and Fixed acidity is negative and that indicates there a reverse relationship. Also, the correlation coefficient has the highest absolute value which indicates that the relationship is the strongest.
# Box plot
ggplot(aes(x = pH, y = fixed.acidity), data = wine_data) +
geom_point(aes(color = rank, fill = rank), position = position_jitter(h = 0)) +
stat_smooth(method = 'lm')+
labs(title="pH VS Fixed acidity",
x="pH", y ="Fixed acidity")
The correlation coefficient between pH and Fixed acidity is negative and that indicates there a reverse relationship. Also, the correlation coefficient has the highest absolute value which indicates that the relationship is the strongest. And as we notice from the figure above, most red wines are in “good” quality.
First, I chose red wine dataset which has 1599 kind of wines and 12 variables, I start to explore the data and understand it by exploring and visualizing every variable. Then, I tried to answer the questions by finding their answers from analysis, visualization and matrices. I tried to find the relationship between factors, which is the the strongest relatioship, which is positive and which is negative. I noticed that the strogest relationship is between pH and Fixed acidity but it is a negative relationship. I explored the most factors that affect the wine quality, the most effective factor is alcohol, then volatile acidity. I enjoyed this analysis and I think IF we have larger dataset we can have more findings.